Clinical Corpus Annotation: Challenges and Strategies
نویسندگان
چکیده
Annotation is an important task for Natural Language Processing (NLP), and the traditional annotation schema, including writing detailed guidelines and training annotators, has proved to work well in many previous annotation projects. However, making medical judgment on clinical data requires medical expertise and annotation can only be done by experts. Recently, we created three corpora for our clinical NLP studies: one marks critical recommendations in radiology reports, and the other two indicate whether a patient has pneumonia based on chest X-ray reports or ICU reports. All the annotations were done by medical experts. In this paper, we discuss various challenges we have encountered when dealing with expert annotation, and lay out some lessons we have learned from the annotation tasks. Our experiments show that medical training alone is not sufficient for achieving high inter-annotator agreement, and NLP researchers should get involved in the annotation process as early as possible despite their lack of medical training.
منابع مشابه
Languages under the influence: Building a database of Uralic languages
For most of the Uralic languages, there is a lack of systematically collected, consequently transcribed and morphologically annotated text corpora. This paper sums up the steps, the preliminary results and the future directions of building a linguistic corpus of some Uralic languages, namely Tundra Nenets, Udmurt, Synya Khanty, and Surgut Khanty. The experiences of building a corpus containing ...
متن کاملAre We There Yet?: The Development of a Corpus Annotated for Social Acts in Multilingual Online Discourse
We present the AAWD and AACD corpora, a collection of discussions drawn from Wikipedia talk pages and small group IRC discussions in English, Russian and Mandarin. Our datasets are annotated with labels capturing two kinds of social acts: alignment moves and authority claims. We describe these social acts, discuss our annotation process, highlight challenges we encountered and strategies we emp...
متن کاملChallenges in the development of annotated corpora of computer-mediated communication in Indian Languages: A Case of Hindi
The present paper describes an ongoing effort to compile and annotate a large corpus of computer-mediated communication (CMC) in Hindi. It describes the process of the compilation of the corpus, the basic structure of the corpus and the annotation of the corpus and the challenges faced in the creation of such a corpus. It also gives a description of the technologies developed for the processing...
متن کاملChallenges in Automating Maze Detection
SALT is a widely used annotation approach for analyzing natural language transcripts of children. Nine annotated corpora are distributed along with scoring software to provide norming data. We explore automatic identification of mazes – SALT’s version of disfluency annotations – and find that cross-corpus generalization is very poor. This surprising lack of crosscorpus generalization suggests s...
متن کاملA Methodology for Corpus Annotation through Crowdsourcing
In contrast to expert-based annotation, for which elaborate methodologies ensure high quality output, currently no systematic guidelines exist for crowdsourcing annotated corpora, despite the increasing popularity of this approach. To address this gap, we define a crowd-based annotation methodology, compare it against the OntoNotes methodology for expert-based annotation, and identify future ch...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2012